1
從玩具資料集到真實世界的混亂
EvoClass-AI002Lecture 5
00:00

1. 搭建橋樑:資料載入基礎

深度學習模型依賴於乾淨且一致的資料,但現實世界中的資料集本質上是雜亂無章的。我們必須從預先打包的基準測試(如MNIST)轉向管理未結構化的資料來源,其中資料載入本身便是一個複雜的協調任務。此過程的基礎在於PyTorch專為資料管理設計的工具。

核心挑戰在於將儲存在磁碟上的原始、分散的資料(影像、文字、音訊檔案)轉換為高度組織化且標準化的PyTorch 張量格式 GPU所期望的格式。這需要自訂邏輯來進行索引、載入、預處理,最後再進行批次化。

現實世界資料的關鍵挑戰

  • 資料混亂: 資料分散於多個目錄中,通常僅由CSV檔案索引。
  • 需要預處理: 影像可能需要調整大小、歸一化或增強,才能轉換為張量。
  • 效率目標: 資料必須以優化、非阻塞的批次形式傳送到GPU,以最大化訓練速度。
PyTorch的解決方案:分離職責
PyTorch強調職責分離: Dataset 負責「什麼」(如何存取單一樣本與標籤),而 DataLoader 則負責「如何」(高效批次化、打亂順序及多執行緒傳輸)。
data_pipeline.py
TERMINALbash — data-env
> Ready. Click "Run" to execute.
>
TENSOR INSPECTOR Live

Run code to inspect active tensors
Question 1
What is the primary role of a PyTorch Dataset object?
To organize samples into mini-batches and shuffle them.
To define the logic for retrieving a single, preprocessed sample.
To perform the matrix multiplication inside the model.
Question 2
Which DataLoader parameter enables parallel loading of data using multiple CPU cores?
device_transfer
batch_size
num_workers
async_load
Question 3
If your raw images are all different sizes, which component is primarily responsible for resizing them to a uniform dimension (e.g., $224 \times 224$)?
The DataLoader's collate_fn.
The GPU's dedicated image processor.
The Transformation function applied within the Dataset's __getitem__ method.
Challenge: The Custom Image Loader Blueprint
Define the structure needed for real-world image classification.
You are building a CustomDataset for 10,000 images indexed by a single CSV file containing paths and labels.
Step 1
Which mandatory method must return the total number of samples?
Solution:
The __len__ method.
Concept: Defines the epoch size.
Step 2
What is the correct order of operations inside __getitem__(self, index)?
Solution:
1. Look up file path using index.
2. Load the raw data (e.g., Image).
3. Apply the necessary transforms.
4. Return the processed Tensor and Label.